Preserve ESI tags verbatim during processing#7
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR improves the handling of ESI (Edge Side Includes) tags during HTML5 parsing and serialization, including support for ESI tags that are self-closing, span across HTML element boundaries or contain non-HTML-encoded characters like
&.Approach taken
&becoming&)Each ESI tag is wrapped in an HTML comment that the HTML5 parser treats as atomic. The original tags are preserved verbatim inside the comments and restored exactly during post-processing.
Important: During processing, ESI tags appear as Comment nodes in the DOM, not as Elements. If RewriteHandler transformations move or delete these comment nodes, the final result may not match expectations.
We use the ESI comment syntax defined in Section 3.7 of the ESI specification (
<!--esi ... -->) to hide ESI tags from the HTML5 parser, but include an extrahtml5-tagrewritermarker token.Why is this approach necessary?
ESI tags present multiple challenges for HTML5 parsing:
Self-closing syntax: ESI tags like
<esi:include src="..." />use self-closing syntax, which does not exist in HTML5. The parser treats them as opening tags, causing incorrect nesting.Arbitrary interleaving: ESI tags can span across HTML element boundaries:
HTML5 parsers would "repair" such structures, breaking the intended ESI behavior.
Attribute encoding: HTML5 serializers encode special characters (
&→&), but ESI processors work on a text basis and expect the original characters.What does the ESI standard say?
The ESI Language Specification 1.0 describes ESI as an "XML-based markup language" (Section 1). However, the standard also explicitly states:
ESI elements can be arbitrarily interleaved with the underlying content, and that content does not even need to be HTML. The standard makes no statements about whether HTML entities must be applied to attribute values.
Since parsing ESI-containing documents with an XML parser is likely not possible anyway, assuming XML encoding rules (
&) is not warranted. The safest approach is to preserve ESI tags verbatim.